Pre-submission Review
Monash University
Professor Dianne Cook, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia
Dr. Emi Tanaka, Biological Data Science Institute, Australian National University, Canberra, Australia
Assistant Professor Susan VanderPlas, Statistics Department, University of Nebraska, Lincoln, USA
Senior Lecturer Klaus Ackermann, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia
Exploring the application of visual inference in regression diagnostics and comparing it with conventional hypothesis tests.
Designing an automated visual inference system to assess residual plots of classical normal linear regression model.
Deploying the automatic visual inference system as an online application and publishing the relevant open-source software.
This presentation will be focused on the second project.
Diagnostics are the key to determining whether there is anything importantly wrong with a regression model.
\[\underbrace{\boldsymbol{e}}_\textrm{Residuals} = \underbrace{\boldsymbol{y}}_\textrm{Observations} - \underbrace{f(\boldsymbol{x})}_\textrm{Fitted values}\]
Graphical approaches (plots) are the recommended methods for diagnosing residuals.
Vertical spread of the points varies with the fitted values indicates the existence of heteroskedasticity.
However, this is an over-interpretation.
The visual pattern is caused by a skewed distribution of the predictor.
The reading of residual plots can be calibrated by an inferential framework called visual inference (Buja, et al. 2009).
Typically, a lineup of residual plots consists of
To perform a visual test
Modern computer vision models are well-suited for addressing this challenge.
It is usually built on a deep neural network called convolutional neural network (CNN).
Source: https://en.wikipedia.org/wiki/Convolutional_neural_network
To develop a computer vision model for assessing lineups of residual plots, we need to define a numerical measure of “difference” or “distance” between plots.
Consider the classical normal linear regression model
\[\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]
By the Frisch-Waugh-Lowell theorem,
\[\boldsymbol{e} = \boldsymbol{R}\boldsymbol{\varepsilon},\] where \(\boldsymbol{R}=\boldsymbol{I}_n -\boldsymbol{X}(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\).
We treat \(\boldsymbol{e}\) as a vector of random variables here.
The minimization of the sum of square residuals implies
\[\sum_{i=1}^{n} e_i = 0 \quad\text{and}\quad \text{rank}(\boldsymbol{R}) = n - 1.\]
Thus, \(\boldsymbol{e}\) follows a degenerate multivariate distribution.
For simplicity, we will replace \(\text{cov}(\boldsymbol{e}, \boldsymbol{e}) = \boldsymbol{R}\sigma^2\boldsymbol{R}' = \boldsymbol{R}\sigma^2\) with a full-rank diagonal matrix \(\text{diag}(\boldsymbol{R}\sigma^2)\).
Then for a correctly specified model, \[\boldsymbol{e} \sim N(\boldsymbol{0}_n, \text{diag}(\boldsymbol{R}\sigma^2)).\]
Symbol \(Q\) will be used to represent this reference residual distribution.
However, if the model is misspecified, then the actual residual distribution denoted as \(P\), will be different from \(Q\).
For example, if \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n,\boldsymbol{V})\), where \(\boldsymbol{V} \neq \sigma^2\boldsymbol{I}_n\), \[\boldsymbol{e} \sim N(\boldsymbol{0}_n, \text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R})) \implies \text{Heteroskedasticity}.\]
And if some necessary higher-order predictors \(\boldsymbol{Z}\) are also omitted, \[\boldsymbol{e} \sim N(\boldsymbol{R}\boldsymbol{Z}\boldsymbol{\beta}_z, \text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R})) \implies \text{Non-linearity}~ \& ~\text{Heteroskedasticity}.\]
We defined a distance measure based on Kullback-Leibler divergence to quantify the extent of model violations
\[\begin{align} \label{eq:kl-0} D &= \log\left(1 + D_{KL}\right), \\ \label{eq:kl-1} D_{KL} &= \int_{\mathbb{R}^{n}}\log\frac{p(\boldsymbol{e})}{q(\boldsymbol{e})}p(\boldsymbol{e})d\boldsymbol{e}, \end{align}\]
where \(p(.)\) and \(q(.)\) are the probability density functions for distribution \(P\) and \(Q\) respectively.
For a classical normal linear regression model that omits necessary higher-order predictors \(\boldsymbol{Z}\), and incorrectly assumes \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n,\sigma^2\boldsymbol{I}_n)\) while in fact \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \boldsymbol{V})\) with \(\boldsymbol{V} \neq \sigma^2\boldsymbol{I}_n\), the KL divergence can be further expanded to
\[\begin{align} \label{eq:kl-2} D_{KL} &= \frac{1}{2}\left(\log\frac{|\boldsymbol{W}|}{|\text{diag}(\boldsymbol{R}\sigma^2)|} - n + \text{tr}(\boldsymbol{W}^{-1}\text{diag}(\boldsymbol{R}\sigma^2)) + \boldsymbol{\mu}_z'\boldsymbol{W}^{-1}\boldsymbol{\mu}_z\right), \end{align}\]
where \(\boldsymbol{\mu}_z = \boldsymbol{R}\boldsymbol{Z}\boldsymbol{\beta}_z\), and \(\boldsymbol{W} = \text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R})\).
For non-normal \(\boldsymbol{\varepsilon}\), \(P\) is unlikely to be a well-known distribution.
Therefore, the elements of \(\boldsymbol{e}\) are assumed to be independent of each other.
Under the independence assumption, the integration becomes a summation
\[\begin{align} \label{eq:kl-3} D_{KL} &\approx \sum_{i = 1}^{n} \hat{D_{KL}^{(i)}}, \\ \hat{D_{KL}^{(i)}} &= \frac{1}{m}\sum_{j = 1}^{m} log\frac{\hat{p_i}(B_{ij})}{q(B_{ij})}, \end{align}\]
where \(B\) is a matrix with \(n\) rows and \(m\) columns, and each column is a set of simulated residuals, \(m\) is the number of simulations, \(\hat{p_i}(.)\) is the kernel density estimator of \(p_i(.)\) using simulated residuals in \(B_i\).
Let’s get back to the original problem. We have a fitted model on hand, and we want to know how different the residuals are from a set of good residuals.
However, the data generating process is typically unknown!
We can train a computer vision model to approximate \(D\) with a residual plot
\[\begin{equation} \label{eq:d-approx} \widehat{D} = f_{CV}(V_{h \times w}(\boldsymbol{e}, \boldsymbol{\hat{y}})), \end{equation}\]
where \(V_{h \times w}(.)\) is a plotting function that saves a residual plot as an image with \(h \times w\) pixels, and \(f_{CV}(.)\) is a computer vision model which predicts distance in \([0, +\infty)\).
\[\begin{align} \label{eq:data-sim} \boldsymbol{y} &= \boldsymbol{1}_n + \boldsymbol{x}_1 + \beta_1\boldsymbol{x}_2 + \beta_2(\boldsymbol{z} + \beta_1\boldsymbol{w}) + \boldsymbol{k} \odot \boldsymbol{\varepsilon}, \\ \boldsymbol{z} &\propto \text{He}_j(\boldsymbol{x}_1), \\ \boldsymbol{w} &\propto \text{He}_j(\boldsymbol{x}_2), \\ \boldsymbol{k} &= \sqrt{\boldsymbol{1}_n + b(2 - |a|)(\boldsymbol{x}_1 + \beta_1\boldsymbol{x}_2 - a\boldsymbol{1}_n)^2}, \end{align}\]
where \(\text{He}_j(.)\) is the \(j\)th-order probabilist’s Hermite polynomials, and the \(\sqrt{(.)}\) and \((.)^2\) operators are element-wise operators.
Factor \(j\)
Factor \(\sigma_\varepsilon\)
Factor \(a\)
Factor \(b\)
Distribution of \(\varepsilon\)
Non-linearity + Heteroskedasticity
Non-normality + Heteroskedasticity
Distribution of predictor
Non-linearity + Second predictor
The architecture of the computer vision model is adapted from VGG16 (Simonyan and Zisserman 2014).
The computer vision model is trained on the M3 high-performance computing platform (www.massive.org.au), using TensorFlow (Abadi et al. 2016) and Keras (Chollet et al. 2015).
The training, validation and test set contains 64000, 16000 and 8000 images respectively.
The distribution of the target variable \(D\) is controlled such that it roughly follows a uniform distribution.
Multiple models with different image resolutions are trained, the optimized model has input size \(32 \times 32\).
The model performs
The model performs worst on null plots (D = 0) as expected.
For a consistent data generating process, \(D\) typically increases logarithmically with the number of observations.
This behaviour comes from the relationship \(D = \text{log}(1 + D_{KL})\), where \(D_{KL} = \sum_{i=1}^{n}D_{KL}^{(i)}\) under the assumption of independence.
Therefore, the Model Violations Index (MVI) for determining the extent of model violations can be proposed as
\[\begin{equation} \label{eq:mvi} \text{MVI} = C + \widehat{D} - log(n), \end{equation}\]
where \(C\) is a large enough constant keeping the result positive.
We have proposed a statistical test based on the approximated distance
The null distribution can be estimated by predicting approximated distance for a large number of null plots.
The critical value can be estimated by the sample quantile (e.g. \(Q_{null}(0.95)\)) of the null distribution.
The \(p\)-value is the proportion of null plots having approximated distance greater than or equal to the observed one.
We have collected \(7974\) responses to \(1152\) lineups in a human subject experiment conducted in the first project (Li et al. 2023).
Differences:
\(MEDV = \beta_0 + \beta_1RM + \beta_2LSTAT + \beta_3PTRATIO + \varepsilon\)
\(y = \beta_0 + \beta_1x + \varepsilon\)
In this study, we have
introduced a distance measure to effectively captures the magnitude of model violations
proposed and trained a computer vision model to approximate this distance
developed a statistical testing procedure and a Model Violations Index based on the approximated distance
The proposed test shows comparable power to conventional ones, but there’s room for further enhancing its robustness against minor deviations from model assumptions.
Our method offers significant value by easing analysts’ workload in assessing residual plots.
While we encourage continued manual review of residual plots for insightful analysis, our approach serves as a valuable tool for automating or supplementing the diagnostic process.
| Date | Event |
|---|---|
| Mar | Finish and submit the second paper |
| April | Develop a web interface for the automatic visual inference system |
| May | Clean and publish R packages to CRAN |
| July | Finish the third paper |
| Attend the useR! conference | |
| Aug | Submit thesis |
Slides URL: https://patrickli-milestone-3.netlify.app